MEDB 5501, Module12

2025-11-11

Topics to be covered

  • What you will learn
    • Two factor analysis of variance
    • Relationship to linear regression
    • Checking assumptions
    • R code for two factor analysis of variance
    • Interactions
    • R code for interactions
    • Your homework

Two factor analysis of variance

  • Continuous outcome
  • Two categorical predictors
  • Example
    • Hearing test (decibels at high frequency)
    • Age group (Old or Young)
    • Gender (Female or Male)

Balanced data

  • Proportional number in each category level combination group
    • 3 old females, 3 old males, 3 young females, 3 young males
    • 6 old females, 6 old males, 2 young females, 2 young males

Unbalanced data

  • Unequal numbers in some category combinations
    • 3 old females, 3 old males, 3 young females, 2 young males
  • Extreme case: empty category combinations
    • 3 old females, 3 old males, 3 young females, 0 young males

Historical perspective on unbalanced data, 1

Historical perspective on unbalanced data, 2

Historical perspective on unbalanced data, 3

Historical perspective on unbalanced data, 4

A simple illustration of the complexities of unbalanced data, 1

A simple illustration of the complexities of unbalanced data, 2

A simple illustration of the complexities of unbalanced data, 3

Mathematical model, 1

  • \(Y_{ijk}\)
    • i = which level of first category
    • j = which level of second category
    • k = which patient within a category combination

Mathematical model, 2

\(\begin{smallmatrix} Age & Gender & Outcome \\ Old & Female & Y_{111} \\ Old & Female & Y_{112} \\ Old & Female & Y_{113} \\ Old & Male & Y_{121} \\ Old & Male & Y_{122} \\ Old & Male & Y_{123} \\ Young & Female & Y_{211} \\ Young & Female & Y_{212} \\ Young & Female & Y_{213} \\ Young & Male & Y_{221} \\ Young & Male & Y_{222} \\ Young & Male & Y_{223}\end{smallmatrix}\)

Mathematical model, 3

  • \(Y_{ijk} = \mu + \alpha_i + \beta_j +\epsilon_{ijk}\)
    • \(i=1,...,a,\ j=1,...,b\)
    • \(\Sigma \alpha_i=0,\ \Sigma \beta_j=0\)
    • \(\epsilon_{ijk}\) is \(N(0,\sigma)\)
  • \(\bar{Y}_{i..}\) is the average for the ith level of first factor
  • \(\bar{Y}_{.j.}\) is the average for the jth level of second factor
  • \(\bar{Y}_{...}\) is the average for all of the data

Mathematical model, 4

  • \(SS(Total)=\Sigma_i \Sigma_j \Sigma_k\ (Y_{ijk}-\bar{Y}_{...})^2\)
    • df=abn-1
  • \(SS(A)=\Sigma_i\ bn(\bar{Y}_{i..}-\bar{Y}_{...})^2\)
    • df=a-1
  • \(SS(B)=\Sigma_j\ an(\bar{Y}_{.j.}-\bar{Y}_{...})^2\)
    • df=b-1
  • \(SS(Error)=SS(Total)-SS(A)-SS(B)\)
    • df=(abn-1)-(a-1)-(b-1)

Artificial data

# A tibble: 12 × 5
      id age   gender code     db
   <int> <chr> <chr>  <chr> <dbl>
 1     1 old   female of       45
 2     2 old   female of       60
 3     3 old   female of       60
 4     4 old   male   om       65
 5     5 old   male   om       60
 6     6 old   male   om       70
 7     7 young female yf       20
 8     8 young female yf       20
 9     9 young female yf        5
10    10 young male   ym       25
11    11 young male   ym       20
12    12 young male   ym       30

Artificial data with means

# A tibble: 12 × 8
      id age   gender code     db age_mean gender_mean overall_mean
   <int> <chr> <chr>  <chr> <dbl>    <dbl>       <dbl>        <dbl>
 1     1 old   female of       45       60          35           40
 2     2 old   female of       60       60          35           40
 3     3 old   female of       60       60          35           40
 4     4 old   male   om       65       60          45           40
 5     5 old   male   om       60       60          45           40
 6     6 old   male   om       70       60          45           40
 7     7 young female yf       20       20          35           40
 8     8 young female yf       20       20          35           40
 9     9 young female yf        5       20          35           40
10    10 young male   ym       25       20          45           40
11    11 young male   ym       20       20          45           40
12    12 young male   ym       30       20          45           40

SS(Total)

SS(Age)

SS(Gender)

Analysis of variance table

Analysis of Variance Table

Response: db
          Df Sum Sq Mean Sq F value    Pr(>F)    
age        1   4800  4800.0  108.00 2.595e-06 ***
gender     1    300   300.0    6.75   0.02883 *  
Residuals  9    400    44.4                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Break #1

  • What you have learned
    • Two factor analysis of variance
  • What’s coming next
    • Relationship to linear regression

Create indicator variables

# A tibble: 12 × 6
   age   gender code  i_young i_male    db
   <chr> <chr>  <chr>   <dbl>  <dbl> <dbl>
 1 old   female of          0      0    45
 2 old   female of          0      0    60
 3 old   female of          0      0    60
 4 old   male   om          0      1    65
 5 old   male   om          0      1    60
 6 old   male   om          0      1    70
 7 young female yf          1      0    20
 8 young female yf          1      0    20
 9 young female yf          1      0     5
10 young male   ym          1      1    25
11 young male   ym          1      1    20
12 young male   ym          1      1    30

Two factor analysis of variance using aov

m1 <- aov(db ~ age + gender, data=hearing)
anova(m1)
Analysis of Variance Table

Response: db
          Df Sum Sq Mean Sq F value    Pr(>F)    
age        1   4800  4800.0  108.00 2.595e-06 ***
gender     1    300   300.0    6.75   0.02883 *  
Residuals  9    400    44.4                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Two factor analysis of variance using linear regression, 1

m2 <- lm(db ~ age + gender, data=hearing)
anova(m2)
Analysis of Variance Table

Response: db
          Df Sum Sq Mean Sq F value    Pr(>F)    
age        1   4800  4800.0  108.00 2.595e-06 ***
gender     1    300   300.0    6.75   0.02883 *  
Residuals  9    400    44.4                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Two factor analysis of variance using linear regression, 2

tidy(m2)
# A tibble: 3 × 5
  term        estimate std.error statistic      p.value
  <chr>          <dbl>     <dbl>     <dbl>        <dbl>
1 (Intercept)     55        3.33     16.5  0.0000000492
2 ageyoung       -40        3.85    -10.4  0.00000260  
3 gendermale      10.0      3.85      2.60 0.0288      

Break #2

  • What you have learned
    • Relationship to linear regression
  • What’s coming next
    • Checking assumptions

Checking assumptions

  • Normality (Non-normality)
  • Homogeneity (Heterogeneity)
  • Independence (Lack of independence)

Use the boxplot to check assumptions

  • Non-normality if boxplot shows skewness and/or outliers
  • Heterogeneity if boxplot shows large change in variation
  • Draw clustered boxplot to examine every combination of categories
    • Use a\(\times\)b boxplots
  • Independence is checked qualitatively

Alternatives if assumptions violated

  • There is no analog to Kruskal-Wallis or Mann-Whitney-Wilcoxon
  • Consider a log transformation
    • All values greater than 0
    • Groups with larger means have larger variation
    • Data is skewed right and outliers only on the high end

Break #3

  • What you have learned
    • Checking assumptions
  • What’s coming next
    • R code for two factor analysis of variance

Analysis of fruitfly data

Find the file simon-5501-12-moon.qmd on my github site.

Break #4

  • What you have learned
    • R code for two factor analysis of variance
  • What’s coming next
    • Interactions

What is an interaction

  • Impact of one variable is influenced by a second variable
  • Example, influence of alcohol on sleeping pills
  • Three types of interactions
    • Between two categorical predictors
    • Between a categorical and a continuous predictor
    • Between two continuous predictors
  • Interactions greatly complicate interpretation

Interaction plot

  • X axis, first categorical variable
  • Separate lines for second categorical variable
  • Y axis, average outcome

Hypothetical interaction plots, 1

  • No interaction
  • Ineffective treatment
  • Boys/girls similar

  • No interaction
  • Ineffective treatment
  • Boys fare better than girls

Hypothetical interaction plots, 2

  • No interaction
  • Effective treatment
  • Boys/girls similar

  • No interaction
  • Effective treatment
  • Boys fare better than girls

Hypothetical interaction plots, 3

  • Significant interaction
  • Harmful treatment in boys
  • Effective treatment in girls

  • Significant interaction
  • Ineffective treatment in boys
  • Effective treatment in girls

Hypothetical interaction plots, 4

  • Significant interaction
  • Girls fare better overall
  • Effective treatment
  • Much more effective in boys

Indicator variable for an interaction

# A tibble: 12 × 7
   age   gender code  i_young i_male i_m_by_y    db
   <chr> <chr>  <chr>   <dbl>  <dbl>    <dbl> <dbl>
 1 old   female of          0      0        0    45
 2 old   female of          0      0        0    60
 3 old   female of          0      0        0    60
 4 old   male   om          0      1        0    65
 5 old   male   om          0      1        0    60
 6 old   male   om          0      1        0    70
 7 young female yf          1      0        0    20
 8 young female yf          1      0        0    20
 9 young female yf          1      0        0     5
10 young male   ym          1      1        1    25
11 young male   ym          1      1        1    20
12 young male   ym          1      1        1    30

Interpretation of intercept and slopes

When you can’t estimate an interaction

  • Special case, n=1
    • Only one observation for categorical combination

Example, full moon study, 1 of 2

# A tibble: 36 × 3
   month1 moon1      n
   <fct>  <fct>  <int>
 1 Aug    Before     1
 2 Aug    During     1
 3 Aug    After      1
 4 Sep    Before     1
 5 Sep    During     1
 6 Sep    After      1
 7 Oct    Before     1
 8 Oct    During     1
 9 Oct    After      1
10 Nov    Before     1
# ℹ 26 more rows

Example, full moon study, 2 of 2

m1 <- aov(admission ~ month*moon, data=er)
anova(m1)
Analysis of Variance Table

Response: admission
           Df Sum Sq Mean Sq F value Pr(>F)
month      11 455.58  41.417     NaN    NaN
moon        2  41.51  20.757     NaN    NaN
month:moon 22 127.82   5.810     NaN    NaN
Residuals   0   0.00     NaN               

Break #5

  • What you have learned
    • Interactions
  • What’s coming next
    • R code for interactions

Analysis of fruitfly data

Find the file simon-5501-12-fruitfly.qmd on my github site.

Break #6

  • What you have learned
    • R code for interactions
  • What’s coming next
    • Your homework

Your homework

Find the file simon-5501-12-directions.md on my github site.

Summary

  • What you have learned
    • Two factor analysis of variance
    • Relationship to linear regression
    • Checking assumptions
    • R code for two factor analysis of variance
    • Interactions
    • R code for interactions
    • Your homework